# Cross-modal Pretraining

Vit Large Patch16 Siglip 512.v2 Webli
Apache-2.0
ViT image encoder based on SigLIP 2, designed for timm, suitable for vision-language tasks
Image Classification Transformers
V
timm
295
0
Vit Base Patch16 Siglip 256.webli I18n
Apache-2.0
ViT-B-16 vision Transformer model based on SigLIP, containing only the image encoder, utilizing raw attention pooling
Image Classification Transformers
V
timm
16
0
Speecht5 Tts Hr
MIT
A SpeechT5 text-to-speech fine-tuned model optimized for Croatian, trained on Microsoft's SpeechT5 architecture and the VoxPopuli dataset
Speech Synthesis Transformers Other
S
nikolab
124
1
Speecht5 Asr
MIT
A SpeechT5 automatic speech recognition model fine-tuned on the LibriSpeech dataset, supporting speech-to-text conversion.
Speech Recognition Transformers
S
microsoft
12.30k
41
Xclip Base Patch16 Hmdb 8 Shot
MIT
X-CLIP is an extended version of CLIP for general video-language understanding, trained through contrastive learning on video-text pairs, suitable for video classification and video-text retrieval tasks.
Text-to-Video Transformers English
X
microsoft
17
1
Unixcoder Base Nine
Apache-2.0
UniXcoder is a unified multimodal pretraining model that leverages multimodal data (such as code comments and abstract syntax trees) to pretrain code representations.
Multimodal Fusion Transformers English
U
microsoft
17.35k
19
Unixcoder Base
Apache-2.0
UniXcoder is a unified multimodal pretrained model that leverages multimodal data such as code comments and abstract syntax trees for pretraining code representations.
Multimodal Fusion Transformers English
U
microsoft
347.45k
51
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase